- Introduction
- Fundamentals and data structures
- Control flow
- Functions
- Reading data and plotting
- Resources
Week 8 Lecture
R into terminal. The window that appears is called the R console. Any command you type into this prompt is interpreted by the R kernel.R has over 10,000 user-contributed packages on CRAN (The Comprehensive R Archive Network) and many more elsewhere
CRAN “is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R”, see CRAN
To install a package, run directly in R install.packages("packagename") (similar to pip install and conda install, which you run from the terminal, however)
To load a package, include library("package-name") at the beginning of your script (similar to import)
Basic operations in R
Objects in R
Key data structures
Atomic vectors
Lists
Matrices and data frames
| Python | R |
|---|---|
+ |
+ |
- |
- |
/ |
/ |
* |
* |
** |
^ |
log(<number>) exp(<number>) sqrt(<number>) mean(<numbers>) sum(<numbers>)
| Python | R |
|---|---|
< |
< |
>= |
>= |
== |
== |
!= |
!= |
and |
& |
or |
| |
in |
%in% |
The assignment operator in R is <-
Assigns values on the right to objects on the left. Mostly similar to = but subtle differences. Use <- in R
The <- notation also emphasises that = is not a mathematical equal sign when using it for assignments in programming, e.g.,x = x + 1 ?
my_object <- 10 print(my_object)
## [1] 10
my_other_object <- 4 print(my_object - my_other_object)
## [1] 6
A way to think about mutable vs immutable object is whether they are copied to a new address in memory when modified or kept in their old one
Unlike in Python, most objects in R are copied when modified (with some important exceptions), so can be called immutable in that sense
An exception would for example be a vector that has only been assigned to a single name, it can be modified in place
It can pay off to study these topics for performance of code regardless of the language. A very good summary for R, which is also the basis of this slide and the next two, can be found here: https://adv-r.hadley.nz/names-values.html
library(pryr) x <- c(3, 6, 9) x
## [1] 3 6 9
y <- x address(x)
## [1] "0x7f7fb2293f48"
address(y)
## [1] "0x7f7fb2293f48"
x[2] <- 4 x
## [1] 3 4 9
address(x)
## [1] "0x7f7fb41033f8"
address(y)
## [1] "0x7f7fb2293f48"
id():# Mutable # Immutable a = [1,2,3] b = 42 id(a) id(b) a[2] = 4 b += 1 id(a) id(b)
y <- 10 # For example a numeric vector of length 1 class(y) # Class of the object
## [1] "numeric"
typeof(y) # R's internal type for storage
## [1] "double"
length(y) # Length
## [1] 1
attributes(y) # Metadata (matrices e.g. store their dimensions)
## NULL
names(y) # Names
## NULL
dim(y) # Dimensions
## NULL
x <- 5 ls()
## [1] "my_object" "my_other_object" "x" "y"
rm(x) ls()
## [1] "my_object" "my_other_object" "y"
rm(list = ls()) # Remove all
“Everything that exists in R is an object” (John Chambers)
However, object oriented programming (OOP) is much less important in the daily use of R than functional programming
Functional programming treats computation as the evaluation of mathematical functions avoiding mutable data
OOP is also more challenging in R as there are multiple OOP systems called S3, R6, S4, etc.
If you would like to learn about object oriented programming in R (e.g. to write packages), see here
| Dimension | Stores homogenous elements | Stores heterogenous elements |
|---|---|---|
| 1D | Atomic vector | List |
| 2D | Matrix | Data frame |
| nD | Array |
http://adv-r.had.co.nz/Data-structures.html
More extensive lists e.g. here: https://cran.r-project.org/doc/manuals/r-release/R-lang.html
| Python class | Closest R class |
|---|---|
| bool | logical |
| int | numeric: integer |
| float | numeric: double |
| str | character |
| list | unnamed list |
| dict | named list (named vector also has key-value structure but can only store one type) |
| tuple | - |
| set | - |
Six types (excluding raw and complex for this lecture)
Integer and double vectors are called numeric vectors
| Example | Type |
|---|---|
"a", "swc" |
character |
2L (Must add a L at end to denote integer) |
numeric: integer |
2, 15.5 |
numeric: double |
TRUE, FALSE |
logical |
NANA, NaN, Inf and -InfInf is infinity. You can have either positive or negative infinity
1 / 0
## [1] Inf
NaN means Not a number. It is an undefined value
0 / 0
## [1] NaN
Use the c() function to concatenate observations into a vector
char_vec <- c("hello", "world")
print(char_vec)
## [1] "hello" "world"
num_double_vec <- c(5, 4, 100, 7.65) print(num_double_vec)
## [1] 5.00 4.00 100.00 7.65
logical_vec <- c(TRUE, FALSE, TRUE) print(logical_vec)
## [1] TRUE FALSE TRUE
identical(1.41, c(1.41))
## [1] TRUE
vector() (by default the mode is logical, but you can define different modes as shown in the examples below)character(), numeric(), etc.vector()
## logical(0)
vector(mode = "character", length = 10) # with a length and type
## [1] "" "" "" "" "" "" "" "" "" ""
character(5) # character vector of length 5, also see numeric(5) and logical(5)
## [1] "" "" "" "" ""
z <- c("my470", "is")
z <- c(z, "fantastic")
z
## [1] "my470" "is" "fantastic"
series <- 1:10 series
## [1] 1 2 3 4 5 6 7 8 9 10
series <- seq(1, 10, by = 0.1) series
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 ## [16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 ## [31] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 ## [46] 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 ## [61] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 ## [76] 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 ## [91] 10.0
class(series)
## [1] "numeric"
myvector[1:3] selects 1st, 2nd, and 3rd elements in Rmylist[0:3]letters = "abcdefghijklmnopqrztuv" print letters[0:4]
firstletters <- "abcdefg" firstletters[1:4]
## [1] "abcdefg" NA NA NA
substr(firstletters, 1, 3)
## [1] "abc"
ncharlength("London")
## [1] 1
nchar("London")
## [1] 6
To subset a vector, use square parenthesis to index the elements you would like via object[index].
Numerical subsetting
num_double_vec[3]
## [1] 100
num_double_vec[1:2]
## [1] 5 4
x <- c(1, 2, 4)
names(x) <- c("element1", "element2", "element3")
x["element1"]
## element1 ## 1
Caveat: Although this looks somewhat like a Python dictionary, recall that vectors can only store single types
char_vec <- c("hello", "world")
char_vec
## [1] "hello" "world"
logical_vec <- c(TRUE, FALSE) logical_vec
## [1] TRUE FALSE
char_vec[logical_vec]
## [1] "hello"
%*%)fib <- c(1, 1, 2, 3, 5, 8, 13, 21) fib[1:7] + fib[2:8]
## [1] 2 3 5 8 13 21 34
fib <- c(1, 1, 2, 3, 5, 8, 13, 21) fib_greater_five <- fib > 5 print(fib_greater_five)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
x <- c(1, 2, 3) y <- c(5, 10) x * y
## Warning in x * y: longer object length is not a multiple of shorter object ## length
## [1] 5 20 15
x <- 1:20 x * c(1, 0) # turns the even numbers to 0
## [1] 1 0 3 0 5 0 7 0 9 0 11 0 13 0 15 0 17 0 19 0
A factor is a special kind of vector
It is similar to a character vector, but each unique element is also associated with a numerical value which allows to better process categorical data
A factor vector can only contain predefined values
factor_vec <- as.factor(c("a", "b", "c", "a", "b", "c"))
factor_vec
## [1] a b c a b c ## Levels: a b c
as.numeric(factor_vec) # how it is processed in the background
## [1] 1 2 3 1 2 3
vectors in R and lists are called listslist is a collection of any set of object typesA list is a collection of any set of object types
my_list <- list(something = num_double_vec,
another_thing = matrix(data = 1:9, nrow = 3, ncol = 3),
something_else = "my470")
my_list
## $something ## [1] 5.00 4.00 100.00 7.65 ## ## $another_thing ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 ## ## $something_else ## [1] "my470"
my_list["something_else"]
## $something_else ## [1] "my470"
my_list[3]
## $something_else ## [1] "my470"
class(my_list[3])
## [1] "list"
my_list[["something"]]
## [1] 5.00 4.00 100.00 7.65
my_list[[1]]
## [1] 5.00 4.00 100.00 7.65
class(my_list[[1]])
## [1] "numeric"
my_list$another_thing
## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9
(does not allow multiple elements to be indexed in one command)
matrix arranges data from a vector into a tabular form, all elements have to be of the same typeArrays have more than 2 dimensionsmy_matrix <- matrix(data = 1:100, nrow = 10, ncol = 10) my_matrix
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 11 21 31 41 51 61 71 81 91 ## [2,] 2 12 22 32 42 52 62 72 82 92 ## [3,] 3 13 23 33 43 53 63 73 83 93 ## [4,] 4 14 24 34 44 54 64 74 84 94 ## [5,] 5 15 25 35 45 55 65 75 85 95 ## [6,] 6 16 26 36 46 56 66 76 86 96 ## [7,] 7 17 27 37 47 57 67 77 87 97 ## [8,] 8 18 28 38 48 58 68 78 88 98 ## [9,] 9 19 29 39 49 59 69 79 89 99 ## [10,] 10 20 30 40 50 60 70 80 90 100
A data.frame, in contrast, is a matrix-like R object in which the columns can be of different types
my_data_frame <- data.frame(numbers = num_double_vec,
words = char_vec,
logical = logical_vec)
my_data_frame
## numbers words logical ## 1 5.00 hello TRUE ## 2 4.00 world FALSE ## 3 100.00 hello TRUE ## 4 7.65 world FALSE
matrix or data.frame with integers referring to rows and columnsmy_matrix[2, 2]
## [1] 12
my_matrix[2:3, 2:3]
## [,1] [,2] ## [1,] 12 22 ## [2,] 13 23
my_data_frame[, 1]
## [1] 5.00 4.00 100.00 7.65
# Adding some column names to the matrix colnames(my_matrix) = letters[1:10] # Works for matrices and data frames my_matrix[, "e"]
## [1] 41 42 43 44 45 46 47 48 49 50
my_data_frame[, "numbers"]
## [1] 5.00 4.00 100.00 7.65
my_data_frame[, c("numbers", "words")]
## numbers words ## 1 5.00 hello ## 2 4.00 world ## 3 100.00 hello ## 4 7.65 world
# Works only with data frame columns my_data_frame$numbers
## [1] 5.00 4.00 100.00 7.65
- operator and integers (in combination with the c function if multiple rows are dropped)my_matrix[-4, -5]
## a b c d f g h i j ## [1,] 1 11 21 31 51 61 71 81 91 ## [2,] 2 12 22 32 52 62 72 82 92 ## [3,] 3 13 23 33 53 63 73 83 93 ## [4,] 5 15 25 35 55 65 75 85 95 ## [5,] 6 16 26 36 56 66 76 86 96 ## [6,] 7 17 27 37 57 67 77 87 97 ## [7,] 8 18 28 38 58 68 78 88 98 ## [8,] 9 19 29 39 59 69 79 89 99 ## [9,] 10 20 30 40 60 70 80 90 100
my_matrix[-c(2:8), -c(2:8)]
## a i j ## [1,] 1 81 91 ## [2,] 9 89 99 ## [3,] 10 90 100
2:8 creates a vector of the integers 2, … , 8 and the - operator negates these. We wrap the vector in the c function so that - applies to each element, and not just the firstx <- 3
if (x > 4) {
print(24)
} else {
print(17)
}
## [1] 17
else if Partx <- 2
y <- 3
if (x < y) {
print(24)
} else if (x > y) {
print(18)
} else {
print(17)
}
## [1] 24
for (i in 1:4) {
print(i)
}
## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4
character_vector <- c("hello", "world")
for (text in character_vector) {
print(text)
}
## [1] "hello" ## [1] "world"
x <- 1
while (x < 5) {
print(x)
x <- x + 1
}
## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4
# For example:
x <- 1:1000
y <- 1:1000
z <- numeric(1000)
for (i in 1:1000) {
z[i] <- x[i]*y[i]
}
# vs:
z <- x*y
# Or:
z <- 0
for (i in 1:1000) {
z <- z + x[i]*y[i]
}
# vs:
z <- x%*%y
Same considerations apply to vectorised operations in numpy
For an in-depth discussion of measuring and improving performance in R: https://adv-r.hadley.nz/perf-measure.html and https://adv-r.hadley.nz/perf-improve.html
function_name(parameter_one, prameter_two, ...)
mean() function: mean(x, na.rm = FALSE)x is a numeric vectorna.rm is a logical value that indicates whether we’d like to remove missing values (NA). na.rm is set to FALSE by defaultvec <- c(1, 2, 3, NA, 5) mean(x = vec, na.rm = TRUE)
## [1] 2.75
my_addition_function <- function(a = 10, b) {
return(a + b)
}
my_addition_function(a = 5, b = 50)
## [1] 55
my_addition_function(3, 4)
## [1] 7
my_addition_function(b = 100)
## [1] 110
my_demo_function <- function(a) {
a <- a * 2
return(a)
}
a <- 1
my_demo_function(a = 20)
## [1] 40
a
## [1] 1
library(tidyverse) # pipe operators are originally from the `magrittr` package by Stefan Milton Bache x <- c(1,2,3,4,15) mean(x)
## [1] 5
# Same but with the pipe operator x %>% mean()
## [1] 5
x <- c("1", "2")
x %>%
as.numeric() %>%
mean() %>%
sqrt()
## [1] 1.224745
# Easier to read than the equivalent nested functions sqrt(mean(as.numeric(x)))
## [1] 1.224745
applymap() or df.apply() in pandasx <- matrix(1:9, nrow = 3, ncol = 3) x
## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9
apply(X = x, MARGIN = 2, FUN = max)
## [1] 3 6 9
sapply to apply function to every element of a vectorlapply to apply function to every element of listTwo stylised features of functional programming:
First-class functions, i.e. functions that behave like any other data structure. In R, this means that you can do many of the things with a function that you can do with a vector: You can assign them to variables, store them in lists, pass them as arguments to other functions, create them inside functions, and even return them as the result of a function.
“Pure” functions: The output only depends on the inputs, i.e. if you call it again with the same inputs, you get the same outputs. The function also has no side-effects, like changing the value of a global variable, writing to disk, or displaying to the screen. So, e.g., y <- 4; my_function <- function(x) {return(y + x)} is not a pure function.
Source: https://adv-r.hadley.nz/fp.html
Of course not all functions in R always return the same output with the same inputs, e.g., runif() depends on the pseudo-random number seed, and write.csv() writes output to disk
Furthermore, Python also has features of both object oriented and functional programming
Yet, the number of pure functions is arguably higher in R than in some other programming languages
Python, following more the OOP approach, has many methods and attributes attached to objects (recall week 5 on classes)
For examples, consider R vs. pandas in Python. Let’s assume we have some data contained in a data frame object called “df”
colnames(df) vs. df.columns
nrow(df) vs. df.shape[0]
apply(X = df, MARGIN = 2, FUN = max) vs. df.apply(func=max, axis=0)
my_data <- read.csv(file = "my_file.csv")
my_data is an R data.frame objectmy_file.csv is a .csv file with your datastringsAsFactors = FALSE argumentmy_file.csv, it will have to be saved in your current working directory
getwd() to check your current working directorysetwd() to change your current working directorywrite.csv(my_data, "my_file.csv")
set.seed(123) # set random seed to get replicable results n <- 1000 x <- rnorm(n) # draw 1000 points from the normal distribution z <- runif(n) # draw 1000 points from the uniform distribution g <- sample(letters[1:6], n, replace = T) # sample with replacement # Set some parameters, including noise beta1 <- 0.5 beta2 <- 0.3 beta3 <- -0.4 alpha <- 0.3 eps <- rnorm(n, sd = 1) # Generate data that follows a linear trend y <- alpha + beta1 * x + beta2 * z + beta3 * (x * z) + eps # Save data in a data frame my_data <- data.frame(x = x, y = y, z = z, g = g)
Plots are one of the strengths of R
There are two main frameworks for plotting
ggplot2plot(x, y) will give you a scatter plotplot(my_data$x, my_data$y)
?plot for a full list)plot(x = my_data$x, y = my_data$y,
xlab = "X variable", # x axis label
ylab = "Y variable", # y axis label
main = "Some scatter plot", # main title
pch = 19, # solid points
cex = 0.5, # smaller points
bty = "n", # remove surrounding box
col = as.factor(my_data$g) # colour by grouping variable
)
Programming in R programming
Applied data science in R
tidyverse: Collection of packages such as tidyr, dyplr, ggplot2, etc.
tidyverse packages in R for Data Science: https://r4ds.had.co.nz/data.table: Particularly fast package to process very large datasetsggplot2ggplot2 is a flexible tool for visualisation
Book by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen: https://ggplot2-book.org
Great website with ggplot2 sample code for many different types of plots: https://www.r-graph-gallery.com/ggplot2-package.html
glmnet) to random forest (randomForest) or support vector machines (e1071)quanteda package
igraph